Take-home Assignment 3: Predicting HDB Public Housing Resale Pricies using Geographically Weighted Methods

Author

Rhonda Ho Kah Yee

Published

March 9, 2023

Modified

March 26, 2023

Overview

Hello! This is Rhonda Ho’s take-home Assignment 3 for IS415 module.

To view/hide all the code at once, please click on the “</> code” tab beside the title of this html document and select the option to view/hide the code.

The full details of this assignment can be found here.

Task

In this take-home exercise, I am tasked to predict HDB resale prices at the sub-market level (i.e. HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. I am also required to compare the performance of the conventional OLS method versus the geographical weighted methods.

The Data

Below is the list of datasets i have collected.

Type Dataset File Format
Aspatial Resale Flat Prices .csv
Geospatial Master Plan 2014 Subzone Boundary (Web) .kml
Geospatial Bus Stop Location

.kml

.shp

Geospatial Train Station

.kml

.shp

Geospatial School Directory and Information .csv
Geospatial-Extracted Child Care Services .shp
Geospatial-Extracted Eldercare Services .rds
Geospatial-Extracted Hawker Centres .rds
Geospatial-Extracted Kindergarterns .rds
Geospatial-Extracted Parks .rds
Geospatial Supermarket .geoson
Geospatial Shopping Malls .csv

Getting Started

The following packages will be used:

  • sf: used for importing, managing, and processing geospatial data

  • tidyverse: a collection of packages for data science tasks

  • tmap: used for creating thematic maps, such as choropleth and bubble maps

  • sfdep: used to create spatial weights matrix objects, global and local spatial autocorrelation statistics and related calculations (e.g. spatially lag attributes)

  • onemapsgapi: used to query Singapore-specific spatial data, alongside additional functionalities.

  • units: used to for manipulating numeric vectors that have physical measurement units associated with them

  • matrixStats: a set of high-performing functions for operating on rows and columns of matrices

  • jsonlite: a JSON parser that can convert from JSON to the appropraite R data types

  • olsrr: used for building least squares regression models

  • httr : used to make API calls, such as a GET request

  • ggpubr: used for multivariate data visualisation & analysis

  • GWmodel: provides a collection of localised spatial statistical methods, such as summary statistics, principal components analysis, discriminant analysis and various forms of GW regression

  • SpatialML: allows for a geographically weighted random forest regression including a function to find the optical bandwidth

  • Ranger: a fast implementation of Random Forests, particularly suited for high dimensional data

  • cowplot: a simple add-on to ggplot and provides various features that help with creating publication-quality figures

  • Metrics: allow us to calculate the rmse of our models

Code
pacman::p_load(readxl, sf, tidyverse, units, matrixStats, olsrr , gdata, tmap, sfdep, jsonlite, onemapsgapi, rvest, ranger, SpatialML, readxl, GWmodel, httr, cowplot, Metrics  )

Importing Data and Wrangling

Before I can perform my tasks, I need to obtain the following appropriate independent variables from my datasets.

Structural factors:

  1. Area of the unit

  2. Floor level

  3. Remaining lease

  4. Age of the unit

Locational factors

  1. Proximity to CBD

  2. Proximity to eldercare

  3. Proximity to foodcourt/hawker centres

  4. Proximity to MRT

  5. Proximity to park

  6. Proximity to good primary school

  7. Proximity to shopping mall

  8. Proximity to supermarket

  9. Numbers of kindergartens within 350m

  10. Numbers of childcare centres within 350m

  11. Numbers of bus stop within 350m

  12. Numbers of primary school within 1km

Aspatial Data

Resale Flat Prices

Code
resale <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

glimpse(resale)
Rows: 149,071
Columns: 11
$ month               <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block               <chr> "406", "108", "602", "465", "601", "150", "447", "…
$ street_name         <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
$ storey_range        <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
$ floor_area_sqm      <dbl> 44, 67, 67, 68, 67, 68, 68, 67, 68, 67, 68, 67, 67…
$ flat_model          <chr> "Improved", "New Generation", "New Generation", "N…
$ lease_commence_date <dbl> 1979, 1978, 1980, 1980, 1980, 1981, 1979, 1976, 19…
$ remaining_lease     <chr> "61 years 04 months", "60 years 07 months", "62 ye…
$ resale_price        <dbl> 232000, 250000, 262000, 265000, 265000, 275000, 28…

Having a glimpse of our data, we can observe that this dataset has a total of 11 columns and 149071 rows. The columns consist of month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price.

For this assignment, the dataset will focus on:

  • Transaction period: Oct 2022 to February 2023

  • Training dataset period: Oct 2022 to December 2023

  • Test dataset period: January 2023 to February 2023

  • Type of rook flat: 5-room flats

The code chunk filters our dataset accordingly.

Code
resale<- resale %>% 
  filter(flat_type == "5 ROOM") %>%
  filter(month >= "2022-10" & month <= "2023-02")

Based on my senior’s experience, “ST.” is usually written as “SAINT” instead - for example, St. Luke’s Primary School is written as Saint Luke’s Primary School. To address, this, we’ll replace such occurrences:

Code
resale$street_name <- gsub("ST\\.", "SAINT", resale$street_name)

Subsequently, as our dataset is missing the coordinates for the location, I created a function, geocode(), which calls onemapAPI to retrieve the geometry of each location.

Code
#library(httr)
geocode <- function(block, streetname) {
  base_url <- "https://developers.onemap.sg/commonapi/search"
  address <- paste(block, streetname, sep = " ")
  query <- list("searchVal" = address, 
                "returnGeom" = "Y",
                "getAddrDetails" = "N",
                "pageNum" = "1")
  
  res <- GET(base_url, query = query)
  restext<-content(res, as="text")
  
  output <- fromJSON(restext)  %>% 
    as.data.frame %>%
    select(results.LATITUDE, results.LONGITUDE)

  return(output)
}

While exploring the API, I realised that the best search parameter would be to combine the column block with its street_name. Thus, the function geocode() takes in both the block and street name .

Code
resale$LATITUDE <- 0
resale$LONGITUDE <- 0

for (i in 1:nrow(resale)){
  temp_output <- geocode(resale[i, 4], resale[i, 5])
  
  resale$LATITUDE[i] <- temp_output$results.LATITUDE
  resale$LONGITUDE[i] <- temp_output$results.LONGITUDE
}

To reuse the resale dataset, without calling the API, I saved it into a rds file.

Code
write_rds(resale, "data/rds/resale.rds")
Code
resale_rds<-readRDS("data/rds/resale.rds")

Structural Data

In this section, I will be pre processing the structural data that I need for my tasks.

Floor Level

let’s first take a look at the storey_range values.

Code
sort(unique(resale_rds$storey_range))
 [1] "01 TO 03" "04 TO 06" "07 TO 09" "10 TO 12" "13 TO 15" "16 TO 18"
 [7] "19 TO 21" "22 TO 24" "25 TO 27" "28 TO 30" "31 TO 33" "34 TO 36"
[13] "37 TO 39" "40 TO 42"

We can see that there are 17 storey level ranges categories. The following code chunk recodes the categorical naming to numerical values by assigning 1 to the first range 01 TO 03 and 17 to the last range 49 TO 51.

Code
storey <- sort(unique(resale_rds$storey_range))
storey_order <- 1:length(storey)
storey_range_order <- data.frame(storey, storey_order)
storey_range_order
     storey storey_order
1  01 TO 03            1
2  04 TO 06            2
3  07 TO 09            3
4  10 TO 12            4
5  13 TO 15            5
6  16 TO 18            6
7  19 TO 21            7
8  22 TO 24            8
9  25 TO 27            9
10 28 TO 30           10
11 31 TO 33           11
12 34 TO 36           12
13 37 TO 39           13
14 40 TO 42           14

Next, I will combine it to the original resale df.

Code
resale_rds <- left_join(resale_rds,  storey_range_order, by = c("storey_range" = "storey"))
Remaining Lease

Currently, the remaining_lease is in string format but we need it to be numeric. Thus, we need to split the string into month and year and then replace the original values with the calculated value of the remaining lease in years.

Code
#e.g of resale$remaining_lease - 53 years 10 months
str_list <- str_split(resale_rds$remaining_lease, " ")

#after spliting, [53, years, 10, months]
for (i in 1:length(str_list)){
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      resale_rds$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    resale_rds$remaining_lease[i] <- year
  }
}
Age of Unit

To get the age of the unit, I decided to take the current year i.e 2023 and minus the lease commence date year of the the unit.

Code
str_list <- resale_rds$lease_commence_date

for (i in 1:length(str_list)){
    resale_rds$lease_commence_date[i] <- 2023 -
      resale_rds$lease_commence_date[i]
}

Finally, we can convert it into a sf.

Code
resale_sf <- st_as_sf(resale_rds, 
                      coords = c("LONGITUDE", 
                                 "LATITUDE"), 
                      crs=4326) %>%
  st_transform(crs = 3414)

Data Pre-processing

Under this section, we will check for:

  • check for missing values

  • check correct coordinates system

  • check for invalid geometries

Check for Missing Values

Code
#resale_sf
sum(is.na(resale_sf))
[1] 0

We can observe that we have no missing values.

Check for Correct Coordinates System

Code
#resale_sf
st_crs(resale_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Our dataset have the correct Singapore coordinates system of 3414.

Check for Invalid Geometries

Code
#resale_sf
length(which(st_is_valid(resale_sf) == FALSE))
[1] 0

Based on the output, our geometries are valid.

Once again, I saved the processed resale_sf in an rds file.

Code
write_rds(resale_sf, "data/rds/resale_sf.rds")

Geospatial Data

Master Plan 2019 Boundary

The code chunk below retrieves the geospatial polygon data for Singapore’s region in 2019.

Code
mpsz <- st_read(dsn="data/geospatial/MP14_SUBZONE_WEB_PL.kml") %>%
  st_transform(crs = 3414)
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial\MP14_SUBZONE_WEB_PL.kml' 
  using driver `KML'
Simple feature collection with 323 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XYZ
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

Locational Factors Extracted via onemapAPI token

For some of the locational factors, I will be utilising onemapAPI to retrieve its geometry data.

One method I discovered was the usage of a token to retrieve geometry of locational factors based on its related theme. But before we can proceed on, we need to register for account here and retrieve the token.

Code
token <- get_token("user@example.com", "password")

The code chunk below helps us view the available themes. As a token is needed, I first saved the output in an rds file.

Code
#retrieve available themes that we can refer to
avail_themes <-search_themes(token)
write_rds(avail_themes, "data/rds/available_themes.rds")

By reading the according file, I sorted the themes in alphabetical order, for easier reference.

Code
#read the file for available themes
avail_themes<-readRDS("data/rds/available_themes.rds")

#sort by alphabetical order
avail_themes<-avail_themes[order(avail_themes$THEMENAME),]
avail_themes
# A tibble: 193 × 5
   THEMENAME                                     QUERYNAME ICON  CATEG…¹ THEME…²
   <chr>                                         <chr>     <chr> <chr>   <chr>  
 1 ABC Waters Completed                          abcwater… abcl… Enviro… PUBLIC…
 2 ABC Waters Construction                       abcwater… tend… Enviro… PUBLIC…
 3 ABC Waters Sites                              abcwater… abcl… Enviro… PUBLIC…
 4 Active Cemeteries                             activece… circ… Enviro… NATION…
 5 Aedes Mosquito Breeding Habitats - Central    breeding… Pict… Health  NATION…
 6 Aedes Mosquito Breeding Habitats - North East breeding… Pict… Health  NATION…
 7 Aedes Mosquito Breeding Habitats - North West breeding… Pict… Health  NATION…
 8 Aedes Mosquito Breeding Habitats - South East breeding… Pict… Health  NATION…
 9 Aedes Mosquito Breeding Habitats - South West breeding… Pict… Health  NATION…
10 After Death Facilities                        afterdea… coff… Enviro… NATION…
# … with 183 more rows, and abbreviated variable names ¹​CATEGORY, ²​THEME_OWNER

Looking through the available themes from onemap API, some of the themes relevant to our tasks is:

Theme Name Query Name
Child Care Services childcare
Eldercare Services eldercare
Hawker Centres_New (Note: there are other similar themes such as Hawker Centres and Healthier Hawker Centres) hawkercentre_new
Kindergartens kindergartens
Parks (Note: there are other similar themes such as Parks@SG and NParks Parks and Nature Reserves) nationalparks

For the following code chunks, I created a shapefile for each locational factor I am interested in.

The steps taken are:

  1. Retrieve data such as the geometry and name of the place/amenity by using onemap’s get_theme() function which takes in a theme (i.e query name) and the store the data in a df

  2. Convert the df to a sf object by using the st_as_sf() function

  3. Transform crs information to Singapore coordinates system i.e 3414 by using st_transform() function

  4. Write to a shapefile using st_write() function

#| eval: false
#theme: childcare

#retrieve the data such as the geometry and name  accordingly to the theme
childcare_tibble <- get_theme(token, "childcare")

# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
childcare_sf <- st_as_sf(childcare_tibble, coords=c("Lng", "Lat"), 
                        crs=4326) %>% 
  st_transform(crs = 3414)
st_write(obj = childcare_sf,
         dsn = "data/geospatial",
         layer = "childcare",
         driver = "ESRI Shapefile",
         append=FALSE)
#| eval: false
#theme: eldercare

#retrieve the data such as the geometry and name based accordingly to the theme
eldercare_tibble <- get_theme(token, "eldercare")

# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
eldercare_sf <- st_as_sf(eldercare_tibble, coords=c("Lng", "Lat"), 
                        crs=4326) %>% 
  st_transform(crs = 3414)

write_rds(eldercare_sf, "data/rds/eldercare.rds")
#| eval: false
#theme: new hawker centres

#retrieve the data such as the geometry and name based accordingly to the theme
#hawkercentre_new_tibble <- get_theme(token, "hawkercentre_new")

# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
hawkercentre_new_sf <- st_as_sf(hawkercentre_new_tibble, coords=c("Lng", "Lat"), 
                        crs=4326) %>% 
  st_transform(crs = 3414)

write_rds(hawkercentre_new_sf, "data/rds/hawker_new.rds")

#issues writing in this manner
#st_write(obj = hawkercentre_new_sf,
#         dsn = "data/geospatial",
#         layer = "hawkercentre_new",
#        driver = "ESRI Shapefile",
#         append=FALSE)
#| eval: false
#theme: kindergartens

#retrieve the data such as the geometry and name based accordingly to the theme
kindergartens_tibble <- get_theme(token, "kindergartens")

# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
kindergartens_sf <- st_as_sf(kindergartens_tibble, coords=c("Lng", "Lat"), 
                        crs=4326) %>% 
  st_transform(crs = 3414)

write_rds(kindergartens_sf, "data/rds/kindergartens.rds")
#| eval: false
#theme: parks

#retrieve the data such as the geometry and name based accordingly to the theme
nationalparks_tibble <- get_theme(token, "nationalparks")

# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
nationalparks_sf <- st_as_sf(nationalparks_tibble, coords=c("Lng", "Lat"), 
                        crs=4326) %>% 
  st_transform(crs = 3414)

write_rds(nationalparks_sf, "data/rds/nationalparks.rds")

Import the following to be used for our tasks later.

Code
childcare_sf <-st_read("data/geospatial", layer="childcare")
Reading layer `childcare' from data source 
  `C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 1925 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11810.03 ymin: 25596.33 xmax: 45404.24 ymax: 49300.88
Projected CRS: SVY21 / Singapore TM
Code
eldercare_sf<- readRDS("data/rds/eldercare.rds")
hawkercentre_new_sf <- readRDS("data/rds/hawker_new.rds")
kindergartens_sf <- readRDS("data/rds/kindergartens.rds")
nationalparks_sf <- readRDS("data/rds/nationalparks.rds")

CBD Area

For this assignment, I consider the CBD area to be in the Downtown Core so I will be using the coordinates of Downtown Core .

Code
lat= c(1.287953)
lng= c(103.851784)

cbd_sf <- data.frame(lat, lng) %>%
  st_as_sf(coords = c("lng", "lat"), crs=4326) %>%
  st_transform(crs=3414)

Supermarket

Code
supermarket_sf <- st_read("data/geospatial/supermarkets-geojson.geojson") 
Reading layer `supermarkets-geojson' from data source 
  `C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial\supermarkets-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Code
supermarket_sf <- supermarket_sf %>%
  st_transform(crs = 3414)

Bus Stop

Code
bus_stop<- st_read(dsn = "data/geospatial", layer = "BusStop")
Reading layer `BusStop' from data source 
  `C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Code
bus_stop_sf <- bus_stop %>%
  st_transform(crs = 3414)

MRT/LRT

Code
mrt = st_read(dsn = "data/geospatial/", layer = "Train_Station_Exit_Layer")
Reading layer `Train_Station_Exit_Layer' from data source 
  `C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 562 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
Code
mrt_sf <- mrt %>%
  st_transform(crs = 3414)

Primary School

While searching for a dataset for primary school locations, i’ve crossed upon this dataset from data.gov.sg which has the general information of schools in Singapore. By filtering out themainlevel_code which represents the different types of schools to be Primary, I will be able to extract out the address and postal codes of primary schools in Singapore.

Code
primary_school <- read_csv("data/geospatial/general-information-of-schools.csv")

primary_school <- primary_school %>%
  filter(mainlevel_code == "PRIMARY") %>%
  select(school_name, address, postal_code)

glimpse(primary_school)
Rows: 183
Columns: 3
$ school_name <chr> "ADMIRALTY PRIMARY SCHOOL", "AHMAD IBRAHIM PRIMARY SCHOOL"…
$ address     <chr> "11   WOODLANDS CIRCLE", "10   YISHUN STREET 11", "100  Br…
$ postal_code <chr> "738907", "768643", "579646", "159016", "544969", "569785"…

Based on the output, we can observe that there are 183 primary schools in Singapore. However, in 2022, the primary school, Juying Primary School was merged together with Pioneer Primary School and it cannot be found via the API . Thus, I decided to remove it out of our data.

Code
primary_school<-primary_school %>%  
  filter(school_name!='JUYING PRIMARY SCHOOL')

Once again, we can use geocode() function we created earlier to help us extract the geometry of each primary school by its school name.

Code
primary_school$LATITUDE <- 0
primary_school$LONGITUDE <- 0

for (i in 1:nrow(primary_school)){
  temp_output <- geocode(primary_school[i, 1],"")

  primary_school$LATITUDE[i] <- temp_output$results.LATITUDE
  primary_school$LONGITUDE[i] <- temp_output$results.LONGITUDE
}

To reuse the primary school data without recalling the API, I saved it in an rds file.

Code
write_rds(primary_school, "data/rds/primary_school.rds")
Code
primary_school_rds<-readRDS("data/rds/primary_school.rds")

Next, we can convert our df into a sf.

Code
primary_school_sf <- st_as_sf(primary_school_rds,
                    coords = c("LONGITUDE", 
                               "LATITUDE"),
                    crs=4326) %>%
  st_transform(crs = 3414)

However, besides this, we need to determine what is a good primary school (which is necessary for 1 of our tasks). Based on the assumption that a good primary school is a popular one, I picked out the top 10 popular primary schools, referencing its popularity from this website.

Code
popular_primary_schools <-c("Pei Hwa Presbyterian Primary School",
                            "Gongshang Primary School",
                            "Riverside Primary School",
                            "Red Swastika School",
                            "Punggol Green Primary School",
                            "Princess Elizabeth Primary School",
                            "Westwood Primary School",
                            "St. Hilda’s Primary School",
                            "Catholic High School (Primary Section)",
                            "Ai Tong School")
popular_primary_schools
 [1] "Pei Hwa Presbyterian Primary School"   
 [2] "Gongshang Primary School"              
 [3] "Riverside Primary School"              
 [4] "Red Swastika School"                   
 [5] "Punggol Green Primary School"          
 [6] "Princess Elizabeth Primary School"     
 [7] "Westwood Primary School"               
 [8] "St. Hilda’s Primary School"            
 [9] "Catholic High School (Primary Section)"
[10] "Ai Tong School"                        

Next, I need to check whether the top 10 most popular primary schools can be found in the primary school data i extracted earlier. But before that, to make things consistent, I used lapply() function and make the schools names I picked out from the website to be all in uppercase.

Code
#make school names all uppercase
popular_primary_schools <- lapply(popular_primary_schools, toupper) 

# to check both primary school datasets matches
popular_primary_schools_sf <- primary_school_sf %>%
  filter(school_name %in% popular_primary_schools)
Code
nrow(popular_primary_schools_sf)
[1] 8

Based on the output above, out of the 10 primary schools, only 8 can be found. The code chunk below tells us which primary schools are missing.

Code
unique(popular_primary_schools[!(popular_primary_schools %in% popular_primary_schools_sf$school_name)])
[[1]]
[1] "ST. HILDA’S PRIMARY SCHOOL"

[[2]]
[1] "CATHOLIC HIGH SCHOOL (PRIMARY SECTION)"

Looking further into our dataset, I found out that in the original primary school dataset, the schools names of the output above is written different. For eg, CANOSSA CATHOLIC PRIMARY SCHOOL is the same as CATHOLIC HIGH SCHOOL (PRIMARY SECTION). Thus, I decided to use rbind() function to manually add both records to popular_primary_schools_sf.

Code
popular_primary_schools_sf <- popular_primary_schools_sf %>%
  rbind(primary_school_sf %>% filter(school_name == "CANOSSA CATHOLIC PRIMARY SCHOOL"))

popular_primary_schools_sf <- popular_primary_schools_sf %>%
  rbind(primary_school_sf %>% filter(school_name == "ST. HILDA'S PRIMARY SCHOOL"))
Code
nrow(popular_primary_schools_sf)
[1] 10

Shopping Mall

For shopping malls, I used the dataset extracted by Valery Lim who webscrapped the shopping malls data from its wikipedia in 2019.

Code
shopping_mall <- read.csv("data/geospatial/mall_coordinates_updated.csv")

shopping_mall <- shopping_mall %>%
  select(name, latitude, longitude)

glimpse(shopping_mall)
Rows: 184
Columns: 3
$ name      <chr> "100 AM", "112 KATONG", "313@SOMERSET", "321 CLEMENTI", "600…
$ latitude  <dbl> 1.274588, 1.305087, 1.301385, 1.312025, 1.334042, 1.437131, …
$ longitude <dbl> 103.8435, 103.9051, 103.8377, 103.7650, 103.8510, 103.7953, …

Taking a glimpse in our data, there is a total of 184 shopping malls in 2019.

Next, the code chunk below transforms our data with the correct Singapore coordinates system.

Code
shopping_mall_sf <- st_as_sf(shopping_mall,
                        coords = c("longitude",
                                   "latitude"),
                        crs = 4326) %>%
  st_transform(crs = 3414)

Data-Pre Processing

Same as before, we will conduct the following steps for data preprocessing, with an additional step of removing irrelevant columns for certain datasets:

  • removing irrelevant columns

  • check for missing values

  • check correct coordinates system

  • check for invalid geometries

Remove Irrelevant Columns

So for our datasets, we only need to know the name and the geometry so in the following code chunks i will be removing/dropping irrelevant columns.

Code
#mpsz
mpsz <- mpsz %>% select("Name")

#childcare_sf
#combine name and address postal to make it unique, some childcare have same name but diff locations
childcare_sf$full_address <- paste(childcare_sf$NAME, childcare_sf$ADDRESSP)
childcare_sf <- childcare_sf %>% select("full_address")

#eldercare_sf
eldercare_sf$full_address <- paste(eldercare_sf$NAME, eldercare_sf$ADDRESSPOSTALCODE)
eldercare_sf <- eldercare_sf %>% select("full_address")

#hawkercentre_new_sf
hawkercentre_new_sf <- hawkercentre_new_sf %>% select("NAME")

#kindergartens_sf
kindergartens_sf <- kindergartens_sf %>% select("NAME")

#nationalparks_sf
nationalparks_sf <- nationalparks_sf %>% select("NAME")

#supermarket
supermarket_sf <- supermarket_sf %>% select("Description")

#bus stop
bus_stop_sf$stop_name <- paste(bus_stop_sf$BUS_STOP_N, bus_stop_sf$BUS_ROOF_N, bus_stop_sf$LOC_DESC)
bus_stop_sf <- bus_stop_sf %>% select("stop_name")

#mrt
#combine stn name and exit to make each row unique
mrt_sf$stn <- paste(mrt_sf$stn_name, mrt_sf$exit_code)
mrt_sf <- mrt_sf %>% select("stn")

Check for Missing Values

Code
#mpsz
sum(is.na(mpsz))
[1] 0
Code
#childcare_sf
sum(is.na(childcare_sf))
[1] 0
Code
+
#eldercare_sf
sum(is.na(eldercare_sf))
[1] 0
Code
#hawkercentre_new_sf
sum(is.na(hawkercentre_new_sf))
[1] 0
Code
#kindergartens_sf
sum(is.na(kindergartens_sf))
[1] 0
Code
#nationalparks_sf
sum(is.na(nationalparks_sf))
[1] 0
Code
#supermarket_sf
sum(is.na(supermarket_sf))
[1] 0
Code
#bus stop
sum(is.na(bus_stop_sf))
[1] 0
Code
#MRT
sum(is.na(mrt_sf))
[1] 0
Code
#shopping_mall_sf
sum(is.na(shopping_mall_sf))
[1] 0
Code
#primary_school_sf
sum(is.na(primary_school_sf))
[1] 0
Code
#popular_primary_schools_sf
sum(is.na(popular_primary_schools_sf))
[1] 0

Based on the output, there is no missing values.

Check for Correct Coordinates System

Code
#mpsz
st_crs(mpsz)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#childcare_sf
st_crs(childcare_sf)
Coordinate Reference System:
  User input: SVY21 / Singapore TM 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#eldercare_sf
st_crs(eldercare_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#hawkercentre_new_sf
st_crs(hawkercentre_new_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#kindergartens_sf
st_crs(kindergartens_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#nationalparks_sf
st_crs(nationalparks_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#supermarket_sf
st_crs(supermarket_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#bus_stop_sf
st_crs(bus_stop_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#mrt_sf
st_crs(mrt_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#shopping_mall_sf
st_crs(shopping_mall_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#primary_school_sf
st_crs(primary_school_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Code
#popular_primary_schools_sf
st_crs(popular_primary_schools_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Based on the output, our datasets are in the correct coordinate systems.

Check for Invalid Geometries

Code
#mpsz
length(which(st_is_valid(mpsz) == FALSE))
[1] 9
Code
#childcare_sf
length(which(st_is_valid(childcare_sf) == FALSE))
[1] 0
Code
#eldercare_sf
length(which(st_is_valid(eldercare_sf) == FALSE))
[1] 0
Code
#kindergartens_sf
length(which(st_is_valid(kindergartens_sf) == FALSE))
[1] 0
Code
#hawkercentre_new_sf
length(which(st_is_valid(hawkercentre_new_sf) == FALSE))
[1] 0
Code
#nationalparks_sf
length(which(st_is_valid(nationalparks_sf) == FALSE))
[1] 0
Code
#supermarket_sf
length(which(st_is_valid(supermarket_sf) == FALSE))
[1] 0
Code
#bus_stop_sf
length(which(st_is_valid(bus_stop_sf) == FALSE))
[1] 0
Code
#mrt_sf
length(which(st_is_valid(mrt_sf) == FALSE))
[1] 0
Code
#shopping_mall_sf
length(which(st_is_valid(shopping_mall_sf) == FALSE))
[1] 0
Code
#primary_school_sf
length(which(st_is_valid(primary_school_sf) == FALSE))
[1] 0
Code
#popular_primary_schools_sf
length(which(st_is_valid(popular_primary_schools_sf) == FALSE))
[1] 0

Based on the output, we can see that only the mpsz dataset has invalid geometries. The code chunk beow addresses this issue.

Code
mpsz <- st_make_valid(mpsz)
length(which(st_is_valid(mpsz) == FALSE))
[1] 0

Visualisations

The following code chunks shows some visualisations for the data we have just preprocessed.

Subzone Boundary of Singapore 2014
Code
plot(st_geometry(mpsz))

MRT/LRT Stations Map
Code
tmap_mode("view")
tm_shape(mrt_sf) +
  tm_dots(col="red", size=0.05) +
  tm_view(set.zoom.limits = c(11, 14))
Bus Map
Code
tmap_mode("plot")
tm_shape(mpsz) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(bus_stop_sf) +
  tm_dots(col="red", size=0.05) +
  tm_layout(main.title = "Bus Stops",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)

Resale Flat Prices
Code
tmap_mode("view")
tmap_options(check.and.fix = TRUE)
tm_shape(resale_sf) +  
  tm_dots(col = "resale_price",
          alpha = 0.6,
          style="quantile") +
  # sets minimum zoom level to 11, sets maximum zoom level to 14
  tm_view(set.zoom.limits = c(11,14))

We can observe that from July 2022 to December 2022, the are majority of the 5 room HDB flats can be found in the east side of Singapore and the higher priced flats tend to be in the southern-eastern side of Singapore.

Code
tmap_mode("plot")

Formulated Functions

Proximity Distance Calculations

As per our task , we need to find the proximity of certain facilities i.e proximity to CBD, eldercare, hawker centres, MRT, park, good primary school, shopping mall and supermarket. It can be computed by creating a function called proximity which utilises st_distance() to find the closest facility (shortest distance) with the rowMins() function of our matrixStats package. The values will then be appended to the resale_sf as a new column.

Code
proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units()
  df1[,varname] <- rowMins(dist_matrix)
  return(df1)
}
Code
resale_sf<-readRDS("data/rds/resale_sf.rds")
Code
#CBD, eldercare, hawker centres, MRT, park, good primary school, shopping mall and supermarket.
resale_sf <- 
  proximity(resale_sf, cbd_sf, "PROX_CBD") %>%
  proximity(., eldercare_sf, "PROX_ELDERCARE") %>%
  proximity(., hawkercentre_new_sf, "PROX_HAWKER") %>%
  proximity(., mrt_sf, "PROX_MRT") %>%
  proximity(., nationalparks_sf, "PROX_PARK") %>%
  proximity(., popular_primary_schools_sf, "PROX_GDPRISCH") %>%
  proximity(., shopping_mall_sf, "PROX_SHOPPINGMALL") %>%
  proximity(., supermarket_sf, "PROX_SPRMKT")

Facility Count within Radius Calculations

As per our task, we also want to find the number of facilities within a particular radius. Thus, we have to create another function called num_radius() which utilise st_distance() to compute the distance between the flats and the desiredfacilities, and then sum up the observations with rowSums(). The values will be appended to the data frame as a new column.

Code
num_radius <- function(df1, df2, varname, radius) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units() %>%
    as.data.frame()
  df1[,varname] <- rowSums(dist_matrix <= radius)
  return(df1)
}
Code
#Numbers of kindergartens within 350m, Numbers of childcare centres within 350m, Numbers of bus stop within 350m, Numbers of primary school within 1km
resale_sf <- 
  num_radius(resale_sf, kindergartens_sf, "NUM_KNDRGTN", 350) %>%
  num_radius(., childcare_sf, "NUM_CHILDCARE", 350) %>%
  num_radius(., bus_stop_sf, "NUM_BUS_STOP", 350) %>%
  num_radius(., primary_school_sf, "NUM_PRISCH", 1000)

Once again, I would like to save the latest resale_sf file.

Code
write_rds(resale_sf, "data/rds/resale_sf_complete.rds")
Code
glimpse(readRDS("data/rds/resale_sf_complete.rds"))
Rows: 2,491
Columns: 25
$ month               <chr> "2022-10", "2022-10", "2022-10", "2022-10", "2022-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "5 ROOM", "5 ROOM", "5 ROOM", "5 ROOM", "5 ROOM", …
$ block               <chr> "306", "310A", "551", "551", "431", "501", "253", …
$ street_name         <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 1", "ANG MO KI…
$ storey_range        <chr> "10 TO 12", "13 TO 15", "01 TO 03", "16 TO 18", "2…
$ floor_area_sqm      <dbl> 123, 119, 118, 118, 119, 121, 137, 110, 110, 110, …
$ flat_model          <chr> "Standard", "Improved", "Improved", "Improved", "I…
$ lease_commence_date <dbl> 46, 11, 42, 42, 44, 42, 27, 22, 22, 22, 5, 5, 5, 2…
$ remaining_lease     <chr> "53.83", "89", "57.33", "57.33", "55.42", "57.33",…
$ resale_price        <dbl> 700000, 980000, 568000, 603000, 720000, 670000, 80…
$ storey_order        <int> 4, 5, 1, 6, 9, 3, 1, 7, 1, 5, 2, 9, 7, 8, 7, 4, 3,…
$ geometry            <POINT [m]> POINT (29383.53 38640.51), POINT (29171.38 3…
$ PROX_CBD            <dbl> 8625.861, 8544.547, 9537.543, 9537.543, 8876.533, …
$ PROX_ELDERCARE      <dbl> 211.9637, 143.8115, 1064.6617, 1064.6617, 356.4709…
$ PROX_HAWKER         <dbl> 331.2628, 496.9721, 482.8156, 482.8156, 375.3304, …
$ PROX_MRT            <dbl> 573.2417, 797.7411, 1080.8607, 1080.8607, 365.2200…
$ PROX_PARK           <dbl> 554.3174, 593.1602, 735.9373, 735.9373, 390.1731, …
$ PROX_GDPRISCH       <dbl> 1526.5641, 1292.1608, 3212.9380, 3212.9380, 2376.2…
$ PROX_SHOPPINGMALL   <dbl> 490.9497, 708.9643, 1213.2871, 1213.2871, 514.1228…
$ PROX_SPRMKT         <dbl> 246.5007, 313.5439, 419.9139, 419.9139, 313.5926, …
$ NUM_KNDRGTN         <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,…
$ NUM_CHILDCARE       <dbl> 6, 2, 3, 3, 4, 1, 3, 4, 4, 3, 4, 4, 4, 4, 6, 5, 8,…
$ NUM_BUS_STOP        <dbl> 6, 6, 2, 2, 6, 9, 11, 6, 6, 6, 6, 8, 8, 7, 7, 12, …
$ NUM_PRISCH          <dbl> 2, 2, 1, 1, 3, 1, 3, 3, 3, 3, 2, 3, 3, 2, 2, 4, 4,…

Train-Test Split

As earlier mentioned, for the resale dataset,

  • Training dataset period: October 2022 to December 2023

  • Test dataset period: January 2023 to February 2023

Hence, the code chunk below reads the rds file - resale_sf which we created earlier on and splits the dataset accordingly.

Code
resale_sf_complete<-readRDS("data/rds/resale_sf_complete.rds")
Code
resale_train <- resale_sf_complete %>%
  filter(month >= "2022-10" & month <= "2022-12",
         flat_type == "5 ROOM") 
Code
resale_test <- resale_sf_complete %>%
  filter(month >= "2023-01" & month <= "2023-02",
         flat_type == "5 ROOM") 

For future references, I will be writing the train and test data set of resale_sf as a rds file.

Code
write_rds(resale_train, "data/model/resale_train.rds")
write_rds(resale_test, "data/model/resale_test.rds")

The code chunk below reads our train and test dataset of resale_sf.

Code
train_data <- read_rds("data/model/resale_train.rds")
test_data <- read_rds("data/model/resale_test.rds")
Code
glimpse(train_data)
Rows: 1,496
Columns: 25
$ month               <chr> "2022-10", "2022-10", "2022-10", "2022-10", "2022-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "5 ROOM", "5 ROOM", "5 ROOM", "5 ROOM", "5 ROOM", …
$ block               <chr> "306", "310A", "551", "551", "431", "501", "253", …
$ street_name         <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 1", "ANG MO KI…
$ storey_range        <chr> "10 TO 12", "13 TO 15", "01 TO 03", "16 TO 18", "2…
$ floor_area_sqm      <dbl> 123, 119, 118, 118, 119, 121, 137, 110, 110, 110, …
$ flat_model          <chr> "Standard", "Improved", "Improved", "Improved", "I…
$ lease_commence_date <dbl> 46, 11, 42, 42, 44, 42, 27, 22, 22, 22, 5, 5, 5, 2…
$ remaining_lease     <chr> "53.83", "89", "57.33", "57.33", "55.42", "57.33",…
$ resale_price        <dbl> 700000, 980000, 568000, 603000, 720000, 670000, 80…
$ storey_order        <int> 4, 5, 1, 6, 9, 3, 1, 7, 1, 5, 2, 9, 7, 8, 7, 4, 3,…
$ geometry            <POINT [m]> POINT (29383.53 38640.51), POINT (29171.38 3…
$ PROX_CBD            <dbl> 8625.861, 8544.547, 9537.543, 9537.543, 8876.533, …
$ PROX_ELDERCARE      <dbl> 211.9637, 143.8115, 1064.6617, 1064.6617, 356.4709…
$ PROX_HAWKER         <dbl> 331.2628, 496.9721, 482.8156, 482.8156, 375.3304, …
$ PROX_MRT            <dbl> 573.2417, 797.7411, 1080.8607, 1080.8607, 365.2200…
$ PROX_PARK           <dbl> 554.3174, 593.1602, 735.9373, 735.9373, 390.1731, …
$ PROX_GDPRISCH       <dbl> 1526.5641, 1292.1608, 3212.9380, 3212.9380, 2376.2…
$ PROX_SHOPPINGMALL   <dbl> 490.9497, 708.9643, 1213.2871, 1213.2871, 514.1228…
$ PROX_SPRMKT         <dbl> 246.5007, 313.5439, 419.9139, 419.9139, 313.5926, …
$ NUM_KNDRGTN         <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,…
$ NUM_CHILDCARE       <dbl> 6, 2, 3, 3, 4, 1, 3, 4, 4, 3, 4, 4, 4, 4, 6, 5, 8,…
$ NUM_BUS_STOP        <dbl> 6, 6, 2, 2, 6, 9, 11, 6, 6, 6, 6, 8, 8, 7, 7, 12, …
$ NUM_PRISCH          <dbl> 2, 2, 1, 1, 3, 1, 3, 3, 3, 3, 2, 3, 3, 2, 2, 4, 4,…

Having a glimpse at our training and test data, I realised that there are some columns that are irrelevant for our subsequent tasks such as flat_model , town, flat_type, block and street_name. Thus, I removed it.

Code
train_data <-subset(train_data, select = -c(month,flat_model, town,flat_type, block,street_name, storey_range ))

test_data <-subset(test_data, select = -c(month,flat_model, town,flat_type, block,street_name, storey_range ))

The lease_commence_date represents the age of the unit that we had calculated earlier on but its name can be misleading so I changed the name to age.

Code
colnames(train_data)[2] <- "age"
colnames(test_data)[2] <- "age"

Looking at our data again, the type of age and remaining_lease is characters but it should be numeric/integer so i changed it accordingly.

Code
train_data$age  <- as.numeric(train_data$age)
test_data$age  <- as.numeric(test_data$age)

train_data$remaining_lease  <- as.numeric(train_data$remaining_lease)
test_data$remaining_lease  <- as.numeric(test_data$remaining_lease)

Finally, we are done processing our train and test data for our subsequent tasks!

The code chunk below shows a summary of our resale prices.

Code
summary(train_data$resale_price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 415000  570000  633000  669631  727250 1345000 

Computing Correlation Matrix

Before loading the predictors into a predictive model, it is always a good practice to use correlation matrix to examine if there is sign of multicolinearity.

But before that, we need to drop our geometry values.

Code
train_nogeo <- train_data %>%
  st_drop_geometry()

The code chunk below saves the rds file for training data without geometry so we can use it later.

Code
write_rds(train_nogeo, "data/model/train_nogeo.rds")

The code chunk below computes the correlation matrix.

Code
corrplot::corrplot(cor(train_nogeo), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

In general, an absolute correlation coefficient of >0.7 among two or more predictors indicates the presence of multicollinearity. As we have a lot of variables, it is quite hard to see the correlation value, but most seems to be okay except age with its correlation to remaining_lease being -1. This indicates that age of the unit and the remaining lease have perfect multicollinearity and we should remove either the age or remaining lease when creating our linear regression model.

The code chunk below removes the age of unit from our dataset.

Code
#removes age column
train_nogeo<-train_nogeo[,-2]

Another way to check multicolinearity is looking at the Variance Inflation Factor (VIF) which we will perform later.

Multiple Linear Regression Model (OLS)

First, let’s build our multi regression model where y is resale_price and x is the independent predictors.

Code
mlr1 <- lm(resale_price ~. ,
                data = train_nogeo)

summary(mlr1)

Call:
lm(formula = resale_price ~ ., data = train_nogeo)

Residuals:
    Min      1Q  Median      3Q     Max 
-236572  -46059   -4003   39815  418533 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -2.524e+05  5.764e+04  -4.379 1.28e-05 ***
floor_area_sqm     6.240e+03  3.570e+02  17.477  < 2e-16 ***
remaining_lease    5.952e+03  2.337e+02  25.473  < 2e-16 ***
storey_order       2.131e+04  1.029e+03  20.697  < 2e-16 ***
PROX_CBD          -2.227e+01  6.755e-01 -32.962  < 2e-16 ***
PROX_ELDERCARE     1.160e+01  3.172e+00   3.658 0.000263 ***
PROX_HAWKER       -2.465e+01  3.581e+00  -6.883 8.61e-12 ***
PROX_MRT          -2.826e+01  5.621e+00  -5.027 5.59e-07 ***
PROX_PARK          1.900e+01  4.351e+00   4.367 1.35e-05 ***
PROX_GDPRISCH     -4.721e-01  1.276e+00  -0.370 0.711487    
PROX_SHOPPINGMALL -4.620e+00  6.324e+00  -0.731 0.465180    
PROX_SPRMKT        1.459e+01  1.373e+01   1.062 0.288234    
NUM_KNDRGTN        1.154e+04  2.038e+03   5.663 1.78e-08 ***
NUM_CHILDCARE     -4.383e+03  1.040e+03  -4.216 2.63e-05 ***
NUM_BUS_STOP       5.283e+02  6.902e+02   0.765 0.444202    
NUM_PRISCH        -9.436e+03  1.412e+03  -6.681 3.36e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 75210 on 1480 degrees of freedom
Multiple R-squared:  0.7083,    Adjusted R-squared:  0.7054 
F-statistic: 239.6 on 15 and 1480 DF,  p-value: < 2.2e-16

Based on the output, we can infer that the adjusted R square is 0.7054 which roughly indicates that 70% of the variance of the dependent variable being studied is explained by the variance of the independent variable. It seems that most of the predictors are significant (p-value<0.5) except for PROX_SPRMKT, PROX_SHOPPINGMALL, PROX_GDPRISCH and NUM_BUS_STOP.

Test of multicollinearity (VIF)

We can use ols_vif_tol() function of our olsrr package to calculate VIF. In general, if the VIF value is less than 5, then there is usually no sign/possibility of correlations.

Code
ols_vif_tol(mlr1)
           Variables Tolerance      VIF
1     floor_area_sqm 0.5490904 1.821194
2    remaining_lease 0.4939016 2.024695
3       storey_order 0.8716868 1.147201
4           PROX_CBD 0.4984548 2.006200
5     PROX_ELDERCARE 0.7790581 1.283601
6        PROX_HAWKER 0.7879037 1.269191
7           PROX_MRT 0.8119679 1.231576
8          PROX_PARK 0.8281714 1.207480
9      PROX_GDPRISCH 0.7162116 1.396235
10 PROX_SHOPPINGMALL 0.7027136 1.423055
11       PROX_SPRMKT 0.8063695 1.240126
12       NUM_KNDRGTN 0.7460480 1.340396
13     NUM_CHILDCARE 0.6323711 1.581350
14      NUM_BUS_STOP 0.8319573 1.201985
15        NUM_PRISCH 0.6071676 1.646992

Th higher the VIF, the less reliable our regression results are going to be. Based on the output, there are no signs of multicollinearity as VIF <5.

The code chunk below saves the results of our model to an rds file.

Code
write_rds(mlr1, "data/model/mlr1.rds" ) 

GWR Predictive Method

In this section, we will calibrate our mlr1 model to predict HDB resale prices.

Converting the sf data.frame to SpatialPointDataFrame

Code
train_data_sp <- as_Spatial(train_data[,-2])
train_data_sp
class       : SpatialPointsDataFrame 
features    : 1496 
extent      : 6958.193, 42645.18, 28157.26, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 16
names       : floor_area_sqm, remaining_lease, resale_price, storey_order,         PROX_CBD,      PROX_ELDERCARE,      PROX_HAWKER,         PROX_MRT,        PROX_PARK,    PROX_GDPRISCH,   PROX_SHOPPINGMALL,         PROX_SPRMKT, NUM_KNDRGTN, NUM_CHILDCARE, NUM_BUS_STOP, ... 
min values  :            104,           49.75,       415000,            1,  1610.9563636452, 0.00121769219557756, 49.4878683203785, 27.3898687165439, 60.0396043524142, 115.665591934146, 0.00118710679792838, 0.00109267327868705,           0,             0,            0, ... 
max values  :            149,           95.33,      1345000,           14, 23277.3731998479,     8265.9709141931, 6633.82585101601, 2070.17690667209, 6003.01683926018, 9048.73399591188,    4009.20093399522,    1451.97426427893,           7,            19,           19, ... 

Computing adaptive bandwidth

Code
bw_adaptive <- bw.gwr(resale_price ~.,
                  data = train_data_sp,
                  approach="CV",
                  kernel="gaussian",
                  adaptive=TRUE,
                  longlat=FALSE
                  )

Doing cross-validation while utilising a gaussian kernel, the smallest CV score is 3.054284e+12 and its adaptive bandwidth is 22.

Next, we will save the adaptive bandwidth as an rds file.

Code
write_rds(bw_adaptive, file = "data/model/bw_adaptive.rds")

Constructing the adaptive bandwidth GWR model

First, let’s call the saved bandwith by using the code chunk below.

Code
bw_adaptive <- read_rds("data/model/bw_adaptive.rds")

Now, we can go ahead to calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel as shown in the code chunk below.

Code
gwr_adaptive <- gwr.basic(formula = resale_price ~.,
                          data=train_data_sp,
                          bw=bw_adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE,
                          longlat = FALSE)

The code chunk below will be used to save the model in rds format for future use.

Code
write_rds(gwr_adaptive, "data/model/gwr_adaptive.rds")

The code below can be used to display the model output.

Code
gwr_adaptive <- read_rds("data/model/gwr_adaptive.rds")
gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2023-03-26 19:54:55 
   Call:
   gwr.basic(formula = resale_price ~ ., data = train_data_sp, bw = bw_adaptive, 
    kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  resale_price
   Independent variables:  .
   Number of data points: 1496
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-232352  -45529   -4238   37272  385542 

   Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
   (Intercept)       -1.973e+05  5.911e+04  -3.338 0.000863 ***
   floor_area_sqm     6.251e+03  3.568e+02  17.518  < 2e-16 ***
   remaining_lease    6.167e+03  2.397e+02  25.730  < 2e-16 ***
   storey_order       2.088e+04  1.038e+03  20.112  < 2e-16 ***
   PROX_CBD          -1.891e+01  1.077e+00 -17.560  < 2e-16 ***
   PROX_ELDERCARE     1.140e+01  3.175e+00   3.589 0.000342 ***
   PROX_HAWKER       -2.752e+01  3.720e+00  -7.398 2.30e-13 ***
   PROX_MRT          -3.121e+01  5.644e+00  -5.530 3.79e-08 ***
   PROX_PARK          2.249e+01  4.435e+00   5.070 4.48e-07 ***
   PROX_GDPRISCH      1.427e+00  1.433e+00   0.996 0.319346    
   PROX_SHOPPINGMALL -6.769e+00  6.302e+00  -1.074 0.282938    
   PROX_SPRMKT        1.095e+01  1.375e+01   0.797 0.425714    
   NUM_KNDRGTN        1.036e+04  2.050e+03   5.054 4.86e-07 ***
   NUM_CHILDCARE     -4.090e+03  1.036e+03  -3.950 8.20e-05 ***
   NUM_BUS_STOP       1.515e+02  6.918e+02   0.219 0.826681    
   NUM_PRISCH        -8.990e+03  1.447e+03  -6.213 6.75e-10 ***
   coords.x1          7.400e-01  3.382e-01   2.188 0.028814 *  
   coords.x2         -3.416e+00  7.531e-01  -4.536 6.20e-06 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 74740 on 1478 degrees of freedom
   Multiple R-squared: 0.7123
   Adjusted R-squared: 0.709 
   F-statistic: 215.3 on 17 and 1478 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 8.257033e+12
   Sigma(hat): 74342.42
   AIC:  37841.04
   AICc:  37841.56
   BIC:  36584.84
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 22 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                            Min.     1st Qu.      Median     3rd Qu.       Max.
   Intercept         -8.5125e+08 -9.6843e+06  1.5185e+06  1.3389e+07 5.3217e+09
   floor_area_sqm    -4.0120e+03  2.5913e+03  3.9227e+03  5.9585e+03 1.5536e+06
   remaining_lease   -9.7555e+04  5.4572e+03  7.2246e+03  9.0025e+03 4.9399e+04
   storey_order       6.3018e+03  1.2182e+04  1.5164e+04  1.8695e+04 2.5594e+04
   PROX_CBD          -2.9220e+04 -5.6422e+02 -2.5133e+00  4.8308e+02 2.3844e+05
   PROX_ELDERCARE    -4.8216e+04 -1.2686e+01  1.9113e+01  6.4787e+01 1.2264e+03
   PROX_HAWKER       -1.3614e+03 -4.4118e+01 -6.8652e+00  3.3078e+01 1.1669e+04
   PROX_MRT          -1.4344e+04 -9.4417e+01 -5.0592e+01 -4.7851e+00 8.5307e+02
   PROX_PARK         -6.4812e+02 -4.2881e+01 -9.2976e+00  2.3041e+01 1.5562e+03
   PROX_GDPRISCH     -3.9664e+03 -3.6373e+01  5.5693e+00  5.4074e+01 8.6887e+03
   PROX_SHOPPINGMALL -7.0975e+02 -6.6587e+01 -2.0800e+01  2.1314e+01 1.1766e+04
   PROX_SPRMKT       -1.0656e+03 -3.0144e+01  5.7481e+00  4.8561e+01 2.5178e+03
   NUM_KNDRGTN       -5.2077e+04 -3.3102e+03  3.0430e+03  1.2316e+04 1.7806e+05
   NUM_CHILDCARE     -5.1618e+04 -3.3261e+03  1.7252e+02  2.4240e+03 2.2769e+04
   NUM_BUS_STOP      -4.4570e+04 -8.6265e+02  1.3844e+03  3.6228e+03 2.4699e+04
   NUM_PRISCH        -6.5807e+04 -5.8145e+03  2.0226e+03  8.5195e+03 2.8605e+05
   coords.x1         -2.3903e+04 -3.5479e+02 -5.3459e+01  1.5550e+02 1.3272e+04
   coords.x2         -1.8906e+05 -3.2348e+02 -4.1838e+00  4.0785e+02 2.9058e+04
   ************************Diagnostic information*************************
   Number of data points: 1496 
   Effective number of parameters (2trace(S) - trace(S'S)): 552.7639 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 943.2361 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 36424.99 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 35569.1 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 36939.87 
   Residual sum of squares: 1.369059e+12 
   R-square value:  0.9523041 
   Adjusted R-square value:  0.9243233 

   ***********************************************************************
   Program stops at: 2023-03-26 19:54:57 

From the output above, we can observe that the GWR Adjusted R-square is 0.9243233, which is higher than the non-spatial mulitiple linear regression model’s Adjusted R-square of 0.709 . Based on research, R-squared measures the goodness of fit of a regression model. Hence, a higher R-squared indicates the model is a good fit, while a lower R-squared indicates the model is not a good fit which suggests that the GWR model is a better fit.

GWR model also has a lower AIC i.e 35569.1 than the regression model’s AIC i.e 37841.04. Based on research, the lower AIC scores, the better it is, as AIC penalizes models that use more parameters.

Preparing coordinates data

Extracting coordinates data

The code chunk below extract the x,y coordinates of the full, training and test data sets.

Code
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)

Before continue, we write all the output into rds for future uses

Code
coords_train <- write_rds(coords_train, "data/model/coords_train.rds" )
coords_test <- write_rds(coords_test, "data/model/coords_test.rds" )

Now, we need to retrieve our training data, which does not have the geometry.

Code
train_nogeo_rds<-readRDS("data/model/train_nogeo.rds")

Calibrating Random Forest Model

Now, we will calibrate a model to predict HDB resale price by using random forest function of ranger package. Set.seed() is used to keep results constant as random forest is a simulation. Based on research, random forest uses bootstrap sampling and feature sampling, i.e row sampling and column sampling. Therefore, Random Forest is not affected by multicollinearity that much since it is picking different set of features for different models and every model sees a different set of data points. This means that I do not need to remove the age variable.

Code
set.seed(1234)
rf <- ranger(resale_price ~.,
             data=train_nogeo_rds,
              num.trees = 30)
Code
print(rf)
Ranger result

Call:
 ranger(resale_price ~ ., data = train_nogeo_rds, num.trees = 30) 

Type:                             Regression 
Number of trees:                  30 
Sample size:                      1496 
Number of independent variables:  16 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       3246290896 
R squared (OOB):                  0.830922 

Based on the output, for a random forest of 30 trees, the OOB prediction error (MSE) is 3246290896 and the R squared (OOB) / R2 is 0.830922 which implies that around 83% of the variance of the train data is explained by the model.

Code
sqrt(rf$prediction.error)
[1] 56976.23

Random forest rmse is 56976.23.

Calibrating a Geographically Weighted Random Forest

Next, we would like to calibrate the Geographically Weighted Random Forest by using the grf.bw() function. As the grf.bw() function is computationally expensive, I will be running it on a subset of the data (500 rows). Even though the bandwidth of using 500 rows would differ a lot from the bandwidth of using the whole dataset, but for this assignment approximating it on a subset is good enough as a proof of concept.

Code
gwRF_bw <- grf.bw(formula = resale_price ~ 
                    age +
                    remaining_lease + 
                    floor_area_sqm +
                    storey_order +
                    PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
                    PROX_MRT + PROX_PARK + PROX_GDPRISCH + PROX_SPRMKT +
                    PROX_SHOPPINGMALL + 
                    NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_PRISCH,
                data = train_nogeo_rds[0:500,],
                kernel = "adaptive",
                trees = 30,
                coords = coords_train[0:500,]
                )

As computation time is too long, I will use what has been generated so far. Based on the output, the best bandwidth is 85 with a R2 of 0.761592272965718.

Next, I used the bandwidth discovered previously to calibrate our GWRF model. Afterwards, I saved it into an rds file.

Code
set.seed(1234)
gwRF_adaptive <- grf(resale_price ~ 
                    age +
                    remaining_lease + 
                    floor_area_sqm + storey_order +
                    PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
                    PROX_MRT + PROX_PARK + PROX_GDPRISCH + PROX_SPRMKT +
                    PROX_SHOPPINGMALL + 
                    NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_PRISCH,
                     dframe=train_nogeo_rds, 
                     bw= 85,
                     ntree = 30,
                     kernel="adaptive",
                     coords=coords_train)
write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")
Code
#write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")
gwRF_adaptive <- read_rds("data/model/gwRF_adaptive.rds")
Code
#look at important variables
glimpse(gwRF_adaptive$Local.Variable.Importance)
Rows: 1,496
Columns: 16
$ age               <dbl> 320942757446, 448629756899, 647513499118, 7516663368…
$ remaining_lease   <dbl> 657786726120, 422099032262, 492218074216, 5226643315…
$ floor_area_sqm    <dbl> 41006992058, 30519375720, 105074950776, 185876407538…
$ storey_order      <dbl> 110960855171, 104047373638, 101896718991, 8783605220…
$ PROX_CBD          <dbl> 140999708606, 102437070763, 204273123777, 9483130917…
$ PROX_ELDERCARE    <dbl> 73215701479, 53718852857, 351150757572, 246294051248…
$ PROX_HAWKER       <dbl> 105058129002, 155231371012, 239672497719, 1388072614…
$ PROX_MRT          <dbl> 103492994640, 59711928524, 177245309132, 14323796576…
$ PROX_PARK         <dbl> 37696197821, 42225183408, 69473008268, 80618167170, …
$ PROX_GDPRISCH     <dbl> 96507080949, 102504561673, 179261581918, 25158876663…
$ PROX_SPRMKT       <dbl> 37300420254, 242577557255, 23842728322, 45316038072,…
$ PROX_SHOPPINGMALL <dbl> 57409486399, 47797347471, 29974885175, 38763840047, …
$ NUM_KNDRGTN       <dbl> 53090623507, 55873222840, 10019759988, 12115004309, …
$ NUM_CHILDCARE     <dbl> 10102173169, 18880970827, 6813724611, 17966864708, 9…
$ NUM_BUS_STOP      <dbl> 28784760335, 33066494085, 7765459152, 10159718996, 1…
$ NUM_PRISCH        <dbl> 95700047514, 13935912388, 5150098017, 12549017005, 6…
Code
gwRF_adaptive$Global.Model
Ranger result

Call:
 ranger(resale_price ~ age + remaining_lease + floor_area_sqm +      storey_order + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +      PROX_MRT + PROX_PARK + PROX_GDPRISCH + PROX_SPRMKT + PROX_SHOPPINGMALL +      NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_PRISCH,      data = train_nogeo_rds, num.trees = 30, mtry = 5, importance = "impurity",      num.threads = NULL) 

Type:                             Regression 
Number of trees:                  30 
Sample size:                      1496 
Number of independent variables:  16 
Mtry:                             5 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       3204428693 
R squared (OOB):                  0.8331023 

The output shows that the R2 is 0.8331023. for a GWR model of 30 trees and a bandwidth of 86.

Predicting by using Test Data

Multiple Linear Regression (OLS)

The code chunk below uses predict.lm() function of the stats package to run inference on our test data and save it into RDS format.

Code
ols_pred <- predict.lm(mlr1, test_data[,-2]) %>%
  write_rds("data/model/ols_pred.rds")

Random Forest

The code chunk below uses predict() function of the ranger package to run inference on our test data and save it into RDS format.

Code
test_nogeo <-test_data  %>%
  st_drop_geometry()

rf_pred <- predict(rf, test_nogeo) %>%
  write_rds("data/model/rf_pred.rds")

GWRF model

The code chunk below will be used to combine the test data with its corresponding coordinates data.

Code
test_data <- cbind(test_data, coords_test) %>%
  st_drop_geometry()

Next, predict.grf() of spatialML package will be used to predict the resale value by using the test data and gwRF_adaptive model calibrated earlier.

Code
set.seed(1234)
gwRF_pred <- predict.grf(gwRF_adaptive, 
                           test_data, 
                           x.var.name="X",
                           y.var.name="Y", 
                           local.w=1,
                           global.w=0)

Before moving on, let us save the output into rds file for future use.

Code
GRF_pred <- write_rds(gwRF_pred, "data/model/GRF_pred.rds")

The output of the predict.grf() is a vector of predicted values. It is wiser to convert it into a data frame for further visualisation and analysis.

Code
GRF_pred <- read_rds("data/model/GRF_pred.rds")
GRF_pred_df <- as.data.frame(GRF_pred)

In the code chunk below, cbind() is used to append the predicted values for both the GRF and multiple linear regression mode onto test_data.

Code
#for gwrf
test_data_p <- cbind(test_data, GRF_pred_df)
Code
write_rds(test_data_p, "data/model/test_data_p.rds")

Models Evaluation

To evaluate the 3 models, we can combine the resale price and predicted prices of each model into a single dataframe.

Code
ols_pred_df <- read_rds("data/model/ols_pred.rds") %>%
  as.data.frame()
colnames(ols_pred_df) <- c("ols_pred")

rf_pred_df <- read_rds("data/model/rf_pred.rds")$predictions %>%
  as.data.frame()
colnames(rf_pred_df) <- c("rf_pred")

gwRF_pred_df <-  GRF_pred %>%
  as.data.frame()
colnames(gwRF_pred_df) <- c("gwrf_pred")

prices_pred_df <- cbind(test_data$resale_price, ols_pred_df, rf_pred_df,
                        gwRF_pred_df) %>% 
  rename("actual_price" = "test_data$resale_price")

head(prices_pred_df)
  actual_price ols_pred  rf_pred gwrf_pred
1       682888 728585.5 769835.8  728135.3
2       695000 643365.4 660018.7  698225.9
3       658888 686182.7 754647.1  689452.9
4       790000 789026.6 763765.2  856222.9
5       800000 704982.1 762599.3  782600.2
6       858000 831632.1 755024.8  827002.7

Looking at the results, it seems like gwrf model has the closest predicted value to the actual price as compared to the other models.

Calculating Root Mean Square Error

The root mean square error (RMSE) allows us to measure how far predicted values are from observed values in a regression analysis.

Multiple Linear Regression (OLS)

Code
#lm
#MSE <- mean((test_data$resale_price - lm_predicted_value)^2)
#rmse_lm <- sqrt(MSE)

sqrt(mean((prices_pred_df$actual_price - prices_pred_df$ols_pred)^2))
[1] 69388.4

Random Forest

Code
sqrt(mean((prices_pred_df$actual_price - prices_pred_df$rf_pred)^2))
[1] 55187.82

GWRF model

In the code chunk below, rmse() of Metrics package is used to compute the RMSE for GRF model.

Code
#grf
rmse(test_data_p$resale_price, 
     test_data_p$GRF_pred)
[1] 53185.72
Code
#sqrt(mean((prices_pred_df$actual_price - prices_pred_df$gwrf_pred)^2))

Comparing the 3 models RMSE, the GWRF model has the lowest RMSE of 53185.72 as compared to OLS model’s RMSE of 69388.4 and random forest model’s RMSE of 55187.82. RMSE measures the average difference between values predicted by a model and the actual values and provides an estimation of how well the model is able to predict the target value (accuracy). Hence, among the 3 models, the GWRF model will be better model at predicting the resale price.

Visualising the Predicted Values

Code
test_data_models <- cbind(test_data, prices_pred_df)

gwr_plot <- ggplot(data = test_data_p,
       aes(x = GRF_pred,
           y = resale_price)) +
  geom_point()+
  geom_abline(col = "Red")


lm_plot <- ggplot(data = test_data_models,
       aes(x = ols_pred,
           y = actual_price)) +
  geom_point()+
  geom_abline(col = "Red")

rf_plot <- ggplot(data = test_data_models,
       aes(x = rf_pred,
           y = actual_price)) +
  geom_point()+
  geom_abline(col = "Red")

plot_grid(gwr_plot, lm_plot, rf_plot, labels = "AUTO")

By comparing the 3 graphs, GWRF model is more linear while OLS and random forest model points on the right half side, majority of it is above the red line. Thus, we can see that the GWRF model is better than the OLS and random forest model as the scatter points are more closer to the diagonal line.

Conclusion

In conclusion, be comparing the 3 models’ performance at predicting the actual resale prices, the RMSE and visualisation of its scatterplots, we can determine that GWRF model is the best model at predicting the actual resale prices. One of the reasons why it performs better is because GWR is an outgrowth of ordinary least squares regression (OLS) and adds a level of modeling sophistication by allowing the relationships between the independent and dependent variables to vary by locality.

Acknowledgment

To conclude, I would like to thank Prof. Kam for our IS415 Geospatial Analytics and Applications course materials & resources. I would also like to thank my seniors, Megan Sim and Nor Aisyah Binte Ajit as I have referenced their codes.